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Continuous-time Markov processes over finite state-spaces are widely used to model dynamical processes in 
many fields of natural and social science. Here, we introduce an maximum likelihood estimator for constructing 
such models from data observed at a finite time interval. This estimator is dramatically more efficient than 
prior approaches, enables the calculation of deterministic conhdence intervals in all model parameters, and 
can easily enforce important physical constraints on the models such as detailed balance. We demonstrate 
and discuss the advantages of these models over existing discrete-time Markov models for the analysis of 
molecular dynamics simulations. 
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I. INTRODUCTION 


Estimating the parameters of a continuous-time 
Markov jump process model based on discrete-time ob¬ 
servations of the state of a dynamical system is a prob¬ 
lem which arises in many helds of science, including 
physics, biology, sociology, meteorology, and hnance.Ii® 
Diverse ^plications include the progression of credit risk 
spreads,social mobility,!^ and the evolution of DNA 
sequences in a phylogenetic tree.l^ In chemical physics, 
these models, also called master equations, describe first- 
order chemical kinetics, and are the principle workhorse 
for modeling chemical reactions.!^ 

For complex physical systems, the derivation of kinetic 
models from first principles is often intractable. In these 
circumstances, the parameterization of models from data 
is often a superior approach. As an example, consider 
the dynamical behavior of solvated biomolecules, such 
as proteins and nucleic acids. Despite the microscopic 
complexity of their equations of motion, relatively sim¬ 
ple multi-state kinetics often arise, as exemplihed by the 
ubiquity of two- and few-state Markov process models for 
protein folding.lSHSl 

Due in part to the unavailability of computationally ef¬ 
ficient and numerically robust estimators for continuous¬ 
time Markov models, in the field computational bio¬ 
physics, discrete-time Markov models have been widely 
used to fit and interpret the output of molecular dynam¬ 
ics (MD) simulations. Also called Markov state models 
(MSMs), these methods describe the molecular kinetics 
observed in an MD simulation as a jump process with 
a discr ete-ti me interval generally on the order of ~ 10 - 
100 ns. b^bs i These models provide convenient estimators 
for key quantities of interest for molecular systems, such 
as the free energies of various metastable conformational 
states, the timescales of their interconversion, and the 
dominant transition pathways.^1^^ 

In this work, we introduce an efficient maximum like¬ 
lihood estimator for continuous-time Markov models on 
a finite state space from discrete-time data. The source 
of data used here is identical to that employed in fitting 


discrete-time Markov chain models - namely, the number 
of observed transitions between each pair of states within 
a specified time interval. We demonstrate the properties 
of these models on simple systems, and apply them to the 
analysis of the folding of the FiP35 WW protein domain. 


II. BACKGROUND 


Consider a time-homogenous continuous-time Markov 
process {X{t) : t > 0} over a finite state space, y = 
{1, ..., n}. The process is determined completely by an 
n X n matrix K, variously called its rate matrix, in¬ 
finitesimal generator,substitution matrix,!^ or inten¬ 
sity matrix.^ 

For an interval t > 0, begin with the n x n matrix, 
T(t), of probabilities that the process jumps from one 
state, i, to another state, j, 


T(t),, =P(A(t + r)=j| A(f)=*), (1) 


which, by time-homogeneity is assumed to be indepen¬ 
dent of t. The process’s rate matrix, K, is dehned as 


K= lim 

T—V0+ 


T(t) - I„ 
r 


( 2 ) 


Given K and any time interval, r, the transition prob¬ 
ability matrix, T(t), can be expressed a matrix exponen¬ 
tial 


Z "t^Z 

T(t) = exp(Kr) = ^ . (3) 

i=o *■ 

A particular rate matrix K corresponds to a valid 
continuous-time Markov process if and only if its off- 
diagonal elements are nonnegative and its row sums equal 
zero. These constraints are necessary to ensure that the 
probabilities propagated by the dynamics remain positive 
and sum to one. We denote by this set of admissible 
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rate matrices, 


jr=<^K = {%} e : fcy > 


^ii — 

Furthermore, we denote by ,3^ the set of all embeddable 
transition probability matrices, that is, those which could 
originate as the transition probability matrix, T(t), in¬ 
duced by some continuous-time Markov process, 

^ = {TGM”^”:3KG^s.t.T = exp(K)}. (5) 

It is well-known that set ^ is a strict subset of the set 
of all stoch astic m atrices; not all stochastic matrices are 
embeddable P^ I ^^I A complete description of the topologi¬ 
cal structure of ^ as well as the necessary and sufficient 
conditions for a stochastic matrix to be embeddable are 
open problems in the theory of Markov processes. 

Although Eq. § serves as the definition of the rate 
matrix of a continuous-time Markov process, it is gener¬ 
ally not directly suitable as a method for parameterizing 
Markov models, particularly for applications in chemi¬ 
cal kinetics. The attempt to numerically approximate 
the limit in Eq. ([^ from empirically measured transi¬ 
tion probabilities would be valid if the generating pro¬ 
cess were exactly Markovian. However, in chemical ki¬ 
netics, a Markov process model - the chemical master 
equation - is an approximation valid onl y for timescales 
longer than the molecular relaxation timepSl^ A suitable 
Markov model which is predictive over long timescales 
must capture both the instantaneous kinetics as well as, 
to use the vocabulary of Mori-Zwanzig formalism, t he ef- 
fective contribution of the integrated memory kernel.^^Ulll 

Our goal is to address this parameterization problem. 
The primary contribution of this work is an efficient algo¬ 
rithm for estimating K from observed discrete-time ob¬ 
servations. We adopt a direct maximum likelihood ap¬ 
proach, with O(n^) work per iteration. Many constraints 
on the solution, such as detailed balance or specific spar¬ 
sity patterns on K can be introduced in a straightforward 
manner without additional cost. 

Prior work on this subject is numerous. Crommelin 
and Vanden-Eijnden proposed a method for estimating 
K in which a discrete-time transition probability matrix 
is first fit to the observed data, followed by the determi¬ 
nation of the rate matrix, K such that exp(KT) is ne arest 
to the target empirical transition probability matrix.1221211 
The nature of this calculation depends on the norm used 
to define the concept of “nearest”: under a Frobenius 
norm, this problem has a closed form solution, while 
the norm of Crommelin and Vanden-Eijnden leads to a 
quadratic program. A similar approach was advocated 
by Israel et al® 

Kalbfleisch and Lawless proposed a maximum likeli¬ 
hood estimator for K.l^ Without constraints on the rate 


0 for all i ^ j, (4) 

— ^ 


matrix, their proposed optimization involves the con¬ 
struction and inversion of an x Hessian matrix 
at each iteration of the optimization, rendering it pro¬ 
hibitively costly (0(n®) scaling per iteration) for moder¬ 
ate to large state spaces. 

A series of expectation maximization (EM) algorithms 
are described by Asmussen, Nerman and Olsson, Holmes 
and Rubin, Bl adt and Sprensen, and Hobolth and 
jenseiP^ESHlIll] xhese algorithms treat the state of the 
system between observation intervals as an unobserved 
latent variable, which when interpolated via EM leads to 
more efficient estimators. A review of these algorithms 
is presented by Metzner et al.^^ At best, each iteration 
of the proposed methods scales as 0{n^). 


III. MAXIMUM LIKELIHOOD ESTIMATION 
A. Log-likelihood and gradient 

We take our source of data to be one or more observed 
discrete-time trajectories from a Markov process, x = 
■ ■ ■ ,xnt}, in a finite state space, observed at a 
regular time interval. 

The likelihood of the data given the model and the 
initial state is given in terms of the transition probabil¬ 
ity matrix as the product of the transition probabilities 
assigned to each of the observed jumps in the trajectory. 


N-l 

P(a;|K,a:o) = . (6) 

k=0 

When more than one independent trajectory is observed, 
the data likelihood is a product over trajectory with in¬ 
dividual terms given by Eq. §. 

Because many transitions are potentially observed 
multiple times, Eq. § generally contains many repeated 
terms. Define the observed transition count matrix 
C(t) G 

Af-1 

^ ^ — ‘^) ■ ~ j)- ("f) 

k=0 

Collecting repeated terms, the likelihood can be rewritten 
more compactly as 

P(a:|K,xo) = nT(r)^^"^'A (8) 


Suppose that the rate matrix, K is parameterized by 
a vector, 0 G of independent variables, K = K(0). In 
the most general case, every element of the rate matrix 
may individually be taken as an indep endent variable, 
with b = 11 ? — n. As discussed in Section HIB other pa- 
rameterizations may be used to enforce certain properties 













on K. The logarithm of data likelihood is 


( 9 ) 

( 10 ) 


complexity of constructing the gradient vector to 0{n^) 
FLOPS. 


£(d;T) = lnP(x|K(0),cro), 
ij' 

= X] (c(t) olnexp 


( 11 ) 


where ln(X) is the element-wise natural logarithm, 
exp(X) matrix exponential, and X o Y is the Hadamard 
(element-wise) matrix product. Note that the element¬ 
wise logarithm and matrix exponential are not inverses 
of one another. 

The most straightforward parameter estimator - the 
maximum likelihood estimator (MLE) - selects parame¬ 
ters which maximize the likelihood of the data, 

= argmax £(0; r). (12) 
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To maximize Eq. (12), we focus our attention on quasi- 


Newton optimizers that utilize the first derivatives of 
C{9\t) with respect to 9. This requires an efficient al¬ 
gorithm for computing Ve£(0;r). We achieve this by 
starting from the eigendecomposition of K, 


K = Vdiag(A)U'^, 


(13) 


where the columns of U and V contain the left and right 
eigenvectors of K respectively, jointly normalized such 
that = U^, and A are the corresponding eigenval¬ 
ues. Assuming that K has no repeated eigenvalues, the 
directional derivatives of induce d tran sition probability 
matrix, 9T(r)ij/i90„ are given 

dT{T),,/d9u = V ((U^(5K/90„)V) o X(A, t)) U^, 

(14) 


where X(A, t) is an n x n matrix with entries 


rva _J^exp(rA,), 

< exp(rAi) —exp(rAj) 
I Ai-A, 


^=3, 

j- 


(15) 


The elements of the gradient of the log-likelihood can 
then be constructed as 


dC{9\ t) 


= X] (Dov( (U^(9K/90„)V) oX(A,t))u' 


(16) 


where D^- = C(T)ij/Ty. 

A direct implementation of Eq. (161 requires at least 
4 n X n matrix multiplies for each element of 9, indexed 
by u. If the parameter vector, 0, contains 0{‘n?) parame¬ 
ters, then computing the full gradient will require 0(n®) 
floating point operations (ELOPs). However, two prop¬ 
erties of the Hadamard product and matrix trace can 
be exploited to dramatically reduce the computational 


^(AoB),,=Tr(AB^), (17) 

Tr(A^(BoC)) =Tr(B^(AoC)) . (18) 


Using these identities, the gradient of the log-likelihood 
can be rewritten as 


9£(0; r) 
50„ 


[ 9K/d0„o (u((V'^DU)oX(A,t))v^) 

ij \ ^^-^ 

Z 


(19) 


Note that because Z is independent of u, it can be 
constructed once at the beginning of a gradient calcula¬ 
tion at a cost of 0{n^) ELOPs, and reused for each in¬ 
dex, u. The remainder of the work involves constructing 
the derivative matrix i9K/50„, which is generally quite 
sparse, and a single inexpensive sum of a Hadamard prod¬ 
uct. Overall, this rearrangement reduces the complexity 
of constructing the full gradient vector from 0{n^) to 
0{v?) FLOPS. 


B. Reversible parameterization 

In the application of these models to domain-specific 
problems, additional constraints on the Markov process 
may be known, and enforcing these constraints during 
parameterization can enhance the interpretability of so¬ 
lutions as well as provide a form of regularization. 

For many molecular system, it is known that the un¬ 
derlying dynamics are reversible, and this property can 
be enforced in Markov models as well. A Markov pro¬ 
cess is reversible when the rate matrix, K, satisfies the 
detailed balance condition with respect to a stationary 
distribution, tt, towards which the process relaxes over 
time. 


ttK = 0, (20) 

TTikij = TTjkji y i ^ j- (21) 


This constraint can be enforced on solutions through 
the design of the parameterization function, K(0). If 
>K is reversible, Eq. ( |21[ ) implies that a real symmetric 
n X n matrix, S, can be formed, which we refer to as the 
symmetric rate matrix, such that 


S = = diag( v^) K diag( a/^) 


-1 


( 22 ) 


Because of this symmetry and the constraint on the 
row sums of K, only the upper triangular (exclusive of 
the main diagonal) elements of S, and the stationary vec¬ 
tor, TT, need be directly encoded by the parameter vector, 
0, to fully specify K. Furthermore, since the elements 
of TT are constrained to be positive, working with the 
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element-wise logarithm of tt can enhance numerical sta¬ 
bility. For the elements of S, which are only constrained 
to be nonnegative, the same logarithm transformation 
is inapplicable, as it is incompatible with sparse solu¬ 
tions that set one or more rate constants equal to zero. 
For these reasons, we use a parameter vector, of length 
b = with 9 = {9^^\9^'^'^). The first (”) elements, 

notated encode the off-diagonal elements of S. The 
remaining n elements are notated and are used to 
construct the stationary distribution, tt. From S and 
TT, the off-diagonal and diagonal elements of K are then 
constructed from Eq. (22). In explicit notation, the con¬ 
struction is 


vech(S)j = i e {1,..., n(n — l)/2}, 
exp(6'p^) . 

TTi = -iS {!,...,nj, 

E;=iexp(0p^) 

_ f [D{y^)-^SD{y^)],j, i ^ j 

T=J, 


(23) 

(24) 

(25) 


where vech(A) is the row-major vectorization of the ele¬ 
ments of a symmetric nx n matrix above the main diag¬ 
onal, 

vech(A) — ■ z ^ 2 ^ 3 , . . . , 0,2^m ■ • ■ 5 ^n—l,n] 

(26) 


The necessary gradients of Eq. (25), dK.ijjd9u are 
sparse. For fixed 1 < u < ( 2 ), the nxn matrix /ddu 
over all i, j contains only four nonzero entries, whereas 
for ( 2 ) < u < ("p), the same matrix contains 3n — 2 
nonzero entries. The sum of its Hadamard product with 
Z in Eq. can thus be computed in 0 ( 1 ) or 0(n) time. 
For the remainder of this work, we focus exclusively on 
this reversible parameterization for K(0). 


C. Optimization 


Equipped with the log-likelihood and an efficient al¬ 
gorithm for the gradient, we now consider the construc¬ 
tion of maximum likelihood estimates, Eq. (12). Among 
the first-order quasi-Newton methods tested, we find 
Limited-memory Broyden-Fletcher-Goldfarb-Shanno op¬ 
timizer with bound constr aints (L-BFGS-B) to be the 
most successful and robust 

To begin the optimization, we choose the initial guess 
for 9 according to the following procedure. First, we fit 
the maximum likelihood reversible transition probability 
matrix computed using Algorithm 1 of Prinz et 


Next, we compute its principle matrix logarithm, K, us¬ 
ing an inverse scaling and squaring algorithm, and scaling 
by T.SSl Generally, the MLE reversible transition matrix 
is not embeddable, and thus the principle logarithm is 
complex or has negative off-diagonal entries, and does 
not correspond to any valid continuous-time Markov pro¬ 
cess. We take the initial guess from 9^'^'> directly from the 


stationary eigenvector of the MLE transition matrix, and 
9 {S) fj-Qjjj nearest (by Frobenius norm) valid rate ma¬ 
trix to K, given by max(Re(K), 0).^^ 

The optimization problem is non-convex in the gen¬ 
eral case and may have multiple local minima. Vary¬ 
ing the optimizer’s initialization procedure can thus mit¬ 
igate the risk of convergence to a low quality local min¬ 
imum. One alternative initialization K is the pseudo¬ 
generator, Kp = (T(r) —/ji)/r, which arises from a first- 
order Taylor approximation to the matrix exponential. 
After the optimization has terminated, a useful check is 
to compare the maximum likelihood transition matrix 
T(t) estimated during initialization with the exponen¬ 
tial of the recovered rate matrix, exp(TK'^'^^). Large 
differences between the two matricies, or their eigenspec- 
tra / relaxation timescales, may be symptomatic of non- 
embedability of the data or a convergence failure of the 
optimizer. If the data are available at a lag time shorter 
than T, convergence failures can often also be circum¬ 
vented by using a converged rate matrix obtained from a 
model at a shorter lag time as an initial guess for a model 
at a longer lag time. 


D. Implementation notes 


Because S is symmetric, it can be diagonalized effi¬ 
ciently at cost of 0(4n^/3) FLOPs. The eigenvectors can 
then be rotated by D{zJ^) to give the eigenvectors of K. 
Compared to diagonalizing the non-symmetric matrix K 
directly, this can yield a speedup of 2-lOx in the criti¬ 
cal diagonalization step required to compute the gradient 
vector. 

For each pair of states with an observed transition 
count, (i,j) such that C(T)ij> 0 , the gradient expres¬ 
sions Eq. (|l^, and Eq. (pf^ are only defined when 


Tij > 0. A sufficient condition to ensure this property 
is that K be irreduciblebut this cannot be straight¬ 
forwardly ensured throughout every iteration of the L- 
BFGS-B optimization without heavy-handed measures 
such as complete positivity of K. In practice, we find 
that replacing any zeros values in T with a small con¬ 
stant, such as I X I0“^°, when computing the matrix D 
in Eq. (19) is sufficient to avoid this instability. 


Furthermore, note that calculation of X(A,t) by direct 
implementation of Eq. © can suffer from a substantial 
loss of accuracy for close-lying eigenvalues. The matrix 
can instead be computed in a more precise manner using 
the exprel(x) = {e^ — \)/x or exmpl = — 1 routines, 

which are designed to be accurate for small x and are 
available in numerical libraries such as SLATEG, GSL, 
and the upcoming release of SciPy.l^^Hlll 


IV. QUANTIFYING UNCERTAINTY 

Since all data sets are finite, statistical uncertainty 
in any estimate of a probabilistic model is unavoidable. 
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Therefore, key quantities of interest beyond the maxi¬ 
mum likelihood rate matrix itself, 

are estimates of the sampling uncertainty in and 

estimates of the sampling uncertainty in quantities de¬ 
rived from such as its stationary eigenvector, tt, 

its eigenvalues, Xt, and relaxation timescales. 

In the large sample size limit, the central limit theorem 
guarantees that the distribution of 0^^^^ converges to a 
multivariate normal distribution with a covariance ma¬ 
trix which can be estimated by the inverse of the Hessian 
of the log-likelihood function evaluated at 0^^^ ^ assum¬ 
ing that the MLE does not lie on a constraint boundary.Ssl 
This can be thought of as a second order Taylor expan¬ 
sion for the log-likelihood surface at the MLE; the log- 
likelihood is approximated as a paraboloid with negative 
curvature whose peak is at the MLE and whose width is 
determined by the Hessian matrix at the peak. The ex¬ 
ponential of the log-likelihood, the likelihood surface, is 
then Gaussian, and the multivariate delta theorem can be 
used to derive expressions for the asymptotic variance in 
scalar functions of Computationally, the critical 

component is the computation of the Hessian matrix. 


H,,( 0 ;r) = 

n n 

= EEc 


(27) 


d'^C{0;T) 

39^30^ ’ 

( 3‘^T,^/30^39, (aT,,/90J(5T,,/50„)' 


TL 

^3 


(28) 


For example, the asymptotic variance in the stationary 
distribution can be calculated as 


Var(7rfc) « 


3'Ek 3^1; 


(31) 


where represent the lower nxn block of the asymp¬ 
totic variance covariance matrix and 


^ [tt, - 7rf, i=j, 
d9M j-7r,7rj , I ^ j. 


(32) 


Other key quantities of interest for biophysical appli¬ 
cations include the exponential relaxation timescales of 
the Markov model 


Ti = -{Xi) ^ ie{2,...,n}. (33) 

The asymptotic variance in the relaxation timescales, r^, 
is 


(34) 

uv 

where follows from standard expressions for deriva¬ 
tives of eigensystems,!^ 


^ = 1 r v 

39u X} [ 39u 


(35) 


and its inverse. 


A. Approximate Analytic Hessian 


Direct calculation of the Hessian requires both the 
evaluation of the first derivatives of T as well as the 
more costly second derivatives. A more efficient alter¬ 
native, as pointed out by Kalbfleisch and Lawless, is to 
approximate the second derivatives by estimates of their 
exp ect at ions .1^ 


Let Ci — Cij. 


Taking the expected value of 


Cij conditional on Ct, we approximate C^- 




This makes it possible to factor out of the summa¬ 
tion over j in Eq. (28), and exploit t he p roperty that 
3‘^Tij/39u30y = 0, simplifying Eq. (28) to 


H„„( 0 ;r) 


E 


Q 3T,^ 

T„ d9u 39y • 


(29) 


Equipped with the approximator Eq. (29), the asymp¬ 
totic variance-covariance matrix of 9 is calculated as the 
matrix inverse of the Hessian, S = H~^, and the asymp¬ 
totic variance in each derived quantity g{0) is estimated 
using the multivariate delta method.l^ 


Var(5(0)) « Vg{9^^^f S Vg{0^^^) (30) 


The sampling uncertainty in other derived properties 
which depend continuously on 0 can be calculated simi¬ 
larly. 

When the MLE solution lies at the boundary of the 
feasible region, with one or more elements of 0^'®^ equal 
to zero, we adopt an active set approach to approximate 
S. We refer to the elements of 0^^^ which do not lie on 
a constraint boundary as free parameters. The Hessian 
block for the free parameters is constructed and inverted, 
and the variance and covariance of the constrained ele¬ 
ments as well as their covariance with the free parameters 
is taken to be zero. 


V. NUMERICAL EXPERIMENTS 

We performed numerical experiments on three 
datasets, which demonstrate different aspects of our es¬ 
timator for continuous-time Markov processes. Where 
appropriate, we compare these models to reversible 
discrete-time Markov models which directly estimate 
T(r), parameterized via Algorithm 1 of Prinz et 


A. Recovering a Known Rate Matrix 

First, we constructed a simple synthetic eight state 
Markov process with known rates. The network is shown 
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FIG. 1. A simple eight state Markov process. Connected 
states are labeled with the pairwise rate constants, Kij. Self 
transition rates (not shown), Kii, are equal to the negative 
sum of each states outgoing transition rates, in accordance 
with Eq. Q. 


10 ° 

10 ’^ 

10 "^ 

10 '^ 

10 ^ 10 '‘ 10 =^ 10 ° 10 ^ 
Trajectory length, N 



FIG. 2. Convergence of the estimated rate matrix, K, to 
the true generating rate matrix in Fig. for discrete-time 
trajectories of increasing length simulated from the process in 
Fig. [^with a time step of 1. Using either a 2 norm (blue) or 
Frobenius norm (red), we see roughly power law convergence 
over the range of trajectory lengths studied. 


in Fig. The largest non-zero eigenvalue of K is A 2 « 
—9.40 X 10“^, which corresponds to a slowest exponential 
relaxation timescale, T 2 « 106.4 (arbitrary time units). 

From this model, we simulated discrete-time data with 
a collection interval of 1 time unit by calculating the ma¬ 
trix exponential of K and propagating the discrete-time 
Markov chain. In Fig. we show the convergence of the 
models estimated from this simulation data to the true 
model, as the length of the simulated trajectories grows. 
As expected, the fit parameters get more accurate as the 
size of the data set grows. We observe approximately 
power law convergence as measured by the 2-norm and 
Frobenius norm over the range of trajectory lengths stud¬ 
ied. 

The true rate matrix for this continuous time Markov 
process is sparse - only 7 of the 28 possible pairs of dis¬ 
tinct states are directly connected in Fig. Can this 
graph structure be recovered by our estimator? This task 


FIG. 3. Gomparison of the estimated and true off-diagonal 
rate matrix elements for a trajectory of length N = 10^ sim¬ 
ulated from the process in Fig. with a time step of 1. The 
true non-zero elements of K are well-estimated, as shown in 
the right portion of the plot; here, error bars are small enough 
to be fully obscured by point markers. On the other hand, the 
estimator spuriously estimates non-zero rates between many 
of the states which are not connected in the underlying pro¬ 
cess. However, the 95% confidence intervals for these spurious 
rates each overlap with zero. 


is challenging because of the nature of the discrete-time 
data. The observation that the system transitioned from 
state i (at time t) to state j (at time t + 1) does not 
imply that is non-zero. Instead, the observed i ^ j 
transition may have been mediated by one or more other 
states - the process may have jumped from i to k, and 
then again from ktoj, all within the observation interval. 

When the rate matrix, K is irreducible, the corre¬ 
sponding transition probability matrix T(r) is strictly 
positive for every positive lag time, r.^llThis implies that 
in the limit that the trajectory length, N, approaches in¬ 
finity, at least 1 transition count will almost surely be 
observed between any pair of states, regardless of the 
sparsity of K. 

In Fig. 1^ we attempt to resolve the underlying graph 
structure using the model estimated with a trajectory of 
length N = 10^. The plot compares the estimated rate 
matrix elements with the true values. We find that all of 
the true connections are well-estimated, and that many of 
the zero rates are also correctly identified. However, the 
maximum likelihood estimator also identihes very low, 
but non-zero rates between many of the states which are 
in fact disconnected. 

We computed 95% (1.96cr) confidence intervals for each 
of the estimated rate matrix elements, For each 

of the spuriously non-zero elements, these confidence in¬ 
tervals overlapped with zero. None of the confidence in¬ 
tervals for the properly non-zero rates overlapped with 
zero. These uncertainty estimates can therefore be used, 
in combination with the MLE, to identify the underlying 
graph structure. 

This example demonstrates that some degree of 
sparsity-inducing regularization or variable selection may 





























































7 



- 1.5 - 1.0 - 0.5 0.0 0.5 1.0 


FIG. 4. Brownian dynamics on the 2-dimensional Muller po¬ 
tential was discretized by projecting the simulated trajecto¬ 
ries onto an 8 X 8 grid. A typical trajectory is shown in black. 
The resulting discrete-state process can be approximated as 
a continuous-time Markov process. 



Theoretical Quantiles 


To assess the accuracy of the asymptotic approxima¬ 
tions, we compare the empirical distribution of the es¬ 
timated parameters over the separate data sets with 
the theoretical distribution which would be expected 
based on the Gaussian approximation. Consider a 
scalar model parameter g, such as one of the relaxation 
timescales or equilibrium populations. Fitting a model 
separately on each of the twenty data sets yields esti¬ 
mates, {(ffi, crlJ,... (^ 20 , crl^o)}- If these estimates are 
accurate, then g is normally distributed, g ^ J\f{g,a?). 
Our goal is to examine the accuracy of the estimated 
variances, (t|. Note that the true value of g is unknown, 
but subtracts out when examining standardized differ¬ 
ences between the estimates, which, assuming normality, 
should follow a standard normal distribution, 

Zij = ~ ~ A7(0,1). (36) 


In Fig. we compare the empirical and theoretical 
distributions of {i,j) : 1 < i < 20, * < j < 20, for 
estimates of the first two relaxation timescales using a 
quantile-quantile (Q-Q) plot, a powerful method of com¬ 
paring distributions. The observation that Q-Q plot runs 
close to the y = x line is encouraging, and shows that the 
observed deviates are close to normally distributed, and 
that the approximator’s variance estimates are of the ap¬ 
propriate magnitude. This suggests that the asymptotic 
error expressions can be of practical utility for practition¬ 
ers. 


FIG. 5. Quantile-quantile plot of the standardized differ¬ 
ences, Eq. (361, between estimated relaxation timescales, T 2 
and ra, on twenty i.i.d. datasets. If the estimated timescales 
are normally distributed with the calculated asymptotic vari¬ 
ances, the quantiles of their standardized differences would 
match exactly with the theoretical quantiles of the standard 
normal distribution. 


be required to robustly identify the underlying graph 
structure in Markov process. 


B. Accuracy of Uncertainty Estimates 

How accurate are the approximate asymptotic uncer¬ 
tainty expressions derived in Section [TV] ? To answer this 
question, we performed a numerical experiment with 
twenty independent and identically distributed collec¬ 
tions of trajectories of Brownian dynamics on a two- 
dimensional potential. One of those trajectories is shown 
superimposed on the potential in Fig. along with the 
8 x8 grid used to discretize the process. The Brown¬ 
ian dynamics simulations were performed following the 
same procedure described in McGibbon, Schwantes and 
Pande.l^ 


C. Comparison with discrete-time MSMs 


In a data-limited regime, are continuous-time Markov 
models more capable than discrete-time MSMs? We ex¬ 
tended the analysis in Section |V A| to a larger class of 
generating processes in order to address this question. 
We began by sampling random 100-state Markov process 
rate matrices from scale free random graphs.!^. Details 
of the random rate matrix generation are described in 
Appendix 

From each random rate matrix, K, we sampled three 
discrete-time trajectories of different lengths. Each tra¬ 
jectory was used individually to fit both a continuous¬ 
time and discrete-time Markov model, and the param¬ 
eterized models were then compared to the underlying 
system from which the trajectories were simulated to as¬ 
sess the convergence properties of the approaches. 

In Fig. we consider two notions of error. The first 
norm measures error in the elements of the estimated 
transition matrix, ||T — T||i?. Unlike the experiment in 
Fig. we used the T as the basis of the measure so that 
the continuous-time and discrete-time models could be 
compared on an equal footing. The second error norm 
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FIG. 6. Violin plots of the relative error between continuous¬ 
time and discrete-time Markov models for kinetics on ran¬ 
dom graphs. Values below zero indicate lower error for the 
continuous-time model, whereas values above zero indicate 
the reverse. The shape displays the data density, computed 
with a Gaussian kernel density estimator. Panel (a): as mea¬ 
sured by the Frobenius-norm error in the estimated transition 
matrices, ||T—T||i?, the continuous-time model achieves lower 
errors, with a larger advantage for shorter trajectories. Panel 
(b): as measured by the max-norm error in the estimated 
relaxation timescales, max; |fi — Ti\, the two models are not 
distinguishable. 


we consider is the max-norm error in the estimated relax¬ 
ation timescales, max^ |fi — T^j, which measures a critical 
spectral property of the models. In both panels of Fig. 
the distribution of the difference in error between the 
continuous-time and discrete-time models is plotted; val¬ 
ues below zero indicate that the continuous-time model 
performed better for a particular class of trajectories, 
whereas values above zero indicate the reverse. For each 
condition, we performed N = 30 replicates. 

Our results show that as measured by the transition 
matrix error, the continuous-time Markov process model 
is more accurate in the regimes considered. A binomial 
sign test rejects the hypothesis that the two estimators 
give the same error for all three conditions (two-sided p 
values of [2 x 10“®,2 x 10 “®,! x 10“^] for trajectories 
of length [10^,10^,10®] steps, respectively). The rela¬ 
tive advantage of the continuous-time Markov model de¬ 
creases as the trajectory length increases - its advantage 
is in the sparse data regime when no transition counts 
have been observed between a significant number of pairs 
of states. 

In contrast, as measured by the relaxation timescale 
estimation error, we observe no significant difference be¬ 
tween the continuous-time and discrete-time estimators. 
A binomial sign test does not definitively reject the hy¬ 
pothesis that the two estimators give the same error 
for any of the three conditions (two-sided p values of 
[0.02,0.36,0.85] for trajectories of length [10®, 10^, 10®] 
steps, respectively). Neither estimator is consistently 
more accurate in recovery of the dominant spectral prop¬ 



FIG. 7. The FiP35 WW protein, in its native state. We 
analyzed two lOOps MD trajectories of its folding performed 
by D.E. Shaw Research to estimate a Markov process model 
for its conformational dynamics.^ 


erties of the dynamics. 


D. Application to Protein Folding and Lag Time Selection 

How can these models be applied to the analysis of 
molecular dynamics (MD) simulations of protein folding? 
We obtained two independent ultra-long 100/rs MD sim¬ 
ulations of the FiP35 WW protein,!^ a small 35 residue 
/?-sheet protein (Fig. [^, performed by D.E. Shaw Re¬ 
search on the ANTON supercomputer.™ 

In order to focus on the construction of discrete-state 
Markov models, we initially projected every snapshot of 
the MD trajectories, which were available at a 200 ps 
time interval, into a discrete state space with 100 states 
in a way consistent with prior work.^^ Briefly, this in¬ 
volved the extraction of the distance between the clos¬ 
est non-hydrogen atoms in each pair of amino acids in 
each simulation snapshot,^ followed by the application of 
time-structure independent components analysis (tICA) 
to extra ct th e four most slowly decorrelating degrees of 
freedomj ®® l ®® l which were th en clu stered into 100 states 
using the A:-means algorithm .1^31111 


Although the equations of motion for a protein’s dy¬ 
namics in an MD simulation are Markovian, the gener¬ 
ating process of the data analyzed by our model is not. 
The pre-processing procedure which projects the original 
dynamics from a high-dimensional continuous state space 
(the position and momenta of the constituent atoms) into 
a lower dimensional continuous space or discrete state 
space is not inf ormat ion preserving, and destroys the 
Markov property! ®® ! ^® ! For chemical dynamics, qualitative 
features of the non-Markovianity are well-understood. 
Consider, for example, a metastable system with two 
states, A and B, the system in state A may stochasti¬ 
cally oscillate across the boundary surface many times 
without committing to state B. Whereas for a Markov 
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FIG. 8. Implied exponential relaxation timescales of parameterized (a) continuous-time Markov process model and (b) discrete¬ 
time Markov model as a function of lag time. The relaxation timescales computed by the two algorithms coincide almost exactly 
(r^ = 0.999978). (c) The number of free (non-zero) parameters estimated by the discrete-time and continuous-time models 
respectively; the continuous-time Markov model achieves a more parsimonious representation of the data. 



FIG. 9. The maximum likelihood rate matrix, K, computed 
at a lag time, r, of 100 ns. The state indices were sorted in 
seven macrostates using the PGGA algorithm.^ 


FIG. 10. Gonvergence of selected rate matrix elements as a 
function of lag time. A plausible method for lag time selection 
would be to choose r such that some or all of these entries 
are determined to have plateaued. 


process, the probability distribution of the waiting time 
that the system spends in any states before exiting is 
exponential, chemical dynamics are expected to show a 
higher propensity for short waiting times, corresponding 
to so-called recrossing events.!^ This effect is more pro¬ 
nounced when viewing the process at short lag times - 
the bias induced by approximating the process as Markov 
decreases with lag time.l^ 

For the FiP35 WW domain, we observe that the 
change in the relaxation timescales of the continuous¬ 
time and discrete-time Markov models with respect to 
lag time are essentially identical, as shown in Fig. For 
both model classes, the estimated relaxation timescales 
increase and converge with respect to lag time. This 
is consistent with our results in Fig. (b), which sug¬ 
gest that the estimated timescales are the same for both 
models, especially as the length of the trajectories grow. 
While fitting the models in Fig. we observed a small 
number (2-4) of convergence failures at long lag times, 
which were notable due to a dramatic discontinuinity in 
the relaxation timescales curve. This problem was solved 


by reinitializing the optimization at these lag times from 
the converged solutions at adjacent lag times. 

Because of the essentially unchanged nature of the re¬ 
laxation timescale spectrum, we suggest that when choos¬ 
ing a particular lag time, the same approach be used for 
discrete-time and continuous-time Markov models. Ide¬ 
ally, this entails the selection of a lag time large enough 
that the relaxation timescales are independent of lag 
time.lSSMl For the continuous-time Markov model, other 
techniques may be appropriate as well. For example, 
in Fig. IT we show the convergence of selected diago¬ 
nal entries of the rate matrix as a function of lag time. 
As described in the context of transition state theory, 
these rate constants should plateau with increasing r, 
which pro vides another related basis on which select the 

p aramet er 

The most significant difference between the 
continuous-time and discrete-time estimators in this 
case is the sparsity of the parameterized models. In 
Fig.m; c), we compare the number of non-zero indepen- 















































































10 


dent parameters for both models as a function of r. Of 
the (”J^) = 5050 independent parameters for both the 
continuous-time and discrete-time models, only « 1200 
are nonzero for the continuous-time model, regardless of 
lag time. In contrast, the number of nonzero parameters 
for the discrete-time model model continues to increase 
with lag time. 

We anticipate that the sparsity of K may aid in the 
analysis and interpretation of Markov models. In Fig. 
we show the MLE rate matrix computed at r = 100 ns. 
The state indices were sorted such that states grouped 
together via Peron cluster cluster analysis (PCCA) were 
given adjacent indices.l^ The evident block structure of 
the matrix visually indicates that the protein’s conforma¬ 
tional space consists of a small number of regions with 
generally high within-region rate constants, but weak 
between-region coupling. Although a detailed analysis 
of the biophysics of these conformations is beyond the 
scope of this work, visual analysis of these structures in¬ 
dicate that the model resolves folded and unfolded, as 
well as partially folded intermediate states. 

In interpreting the 1.96a error bars on the relaxation 
timescales in Fig.j^a), cautionary note is warranted. Our 
error analysis considers the number of observed tran¬ 
sitions between states but does not take into account 
any notion of uncertainty in the proper definitions of 
the states themselves, or the error inherent in approx¬ 
imating a non-Markovian process with a Markov pro¬ 
cess. The observation in Fig. [^a) that the magnitude 
of the systematic shift in the timescales with respect to 
lag time is much larger than the error bars suggests that 
the Markov approximation (a model misspecification) is 
a larger source of error, for this dataset, than the statisti¬ 
cal uncertainty in model parameters. For these reasons, 
we caution that these error bars should be interpreted as 
lower bounds rather than upper bounds. 



Number of states 


FIG. 11. Performance of our Markov process estimator, as 
compared to the Holmes-Rubin EM estimator.^ Each itera¬ 
tion of our 0(n^) estimator takes on the order of 1 ms, while 
the 0{n^) Holmes-Rubin estimator takes over 10 seconds per 
iteration for a 100 state model. Using default convergence 
criteria, our estimator often achieves a solution long before 
the EM estimator finishes a single iteration. 

O(n^) vs. O(n^) scaling, the performance difference be¬ 
tween the algorithms is substantial. For n = 100, our al¬ 
gorithm is roughly four orders of magnitude faster per it¬ 
eration; our algorithm takes on the order of 1 ms per iter¬ 
ation, while the Holmes-Rubin estimator’s iteration takes 
over 10 seconds. Using the L-BFGS-B optimizer’s default 
convergence criteria, roughly three quarters (68/91) of 
the runs of our algorithm converge in fewer than 100 it¬ 
erations; a solution is often achieved long before the EM 
estimator has performed a single iteration. 

VI. CONCLUSIONS 


E. Performance 


In order to assess the performance of our maximum 
likelihood estimator, we compared it with an algorithm 
by Holmes and Rubin, which solves the same Markov 
process parameterization problem using an expectation- 
maximization approach.l^ Because the original code was 
unavailable, we reimplemented the algorithm following 
the description by Metzner et ah, where it is denoted 
“Algorithm 4: Enhanced MLE-method for the reversible 
case” .1^ The algorithm scales as O(n^), where n is the 
number of states. Its rate limiting step involves an O(n^) 
FLOP contraction of five n x n matrices into a rank-4 
tensor of dimension n on each axis.^For benchmarking, 
we constructed a variant of the the FiP35 WW protein 


dataset from Section VD in which we varied the num¬ 


ber of states between 10 and 100 during clustering. All 
models were fit on an Intel Xeon E5-2650 using a single 
CPU core. 

As shown in Fig. mi and expected on the basis of the 


In this work, we have introduced a maximum likeli¬ 
hood estimator for continuous-time Markov processes on 
discrete state spaces. This model can be used to estimate 
transition rates between various substates in a dynamical 
system based on observations of the system at a discrete 
time interval. Various constraints on the solution, such 
as detailed balance, can be easily incorporated into the 
model, and asymptotic error analysis can give confidence 
intervals in model parameters and derived quantities. 

With the efficient parameterization problem solved, 
these continuous-time Markov models offer several ad¬ 
vantages over existing MSM methodologies. As com¬ 
pared to discrete-time MSMs, these models are more in¬ 
terpretable for chemists and biologists because they do 
not arbitrarily discretize time. Although a lag time is 
used internally during parameterization, the final esti¬ 
mated quantities are familiar rate constants from chem¬ 
ical kinetics, as opposed to the somewhat unintuitive 
transition probabilities in a discrete-time MSM. Further¬ 
more, these models are more parsimonious, and unlike 
the discrete-time MSM are able to detect that many pairs 
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of states are not immediately kinetically adjacent to one 
another. This makes it possible to more clearly recover 
the underlying graph structure of the kinetics. For appli¬ 
cations such as the determination of transition pathways 
in protein dynamics, we anticipate that this property will 
be valuable. 

Many extensions of this model are possible in fu¬ 
ture work. The simple nature of the constraints on 
9 make Bayesian approaches, especially Hamiltonian 
Monte Carlo, particularly attractive.^^In particular, be¬ 
cause of the separation of 9^'^'> and 9^^'^ in the parame¬ 
terization, strong informative priors on tt may be added 
to extend the work of Trendelkamp-Schroer and Noe.l^ 
The appropriate sparsity inducing priors on 9^^'> may be 
a topic of future work. 

An implementation of this estimator is available in the 
MSMBuilder software package at http; //msmbuilder. 
org/ under the GNU Lesser General Public License. 
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Appendix A: Random rate matrices 


Scale-free random graphs with 100 states were gener¬ 
ated using the Barabasi-Albert preferential attachment 
model with m = S.^From the graph’s adjacency matrix, 
we generated a symmetric rate matrix S by sampling 
a log-normally distributed random variable (/i = —3, 
cr = 2) for each connected edge. The stationary distribu¬ 
tion, TT, was sampled from Dirichlet(a = 1). The matrix 
S was then scaled by 50 • ^ which tuned the 


relaxation timescales in the range between 10^ and 10^ 
time steps, and used with tt in Eq. (25) to construct K. 
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